Importing packages and setting the directories
library(lubridate)
library(data.table) #convert seconds to days
library(plotly) #interactive plots
library(moments) #to get skewness
library(factoextra) #to visualize clusters
library(plyr)
library(svMisc)
getwd()
[1] "C:/Users/pares/OneDrive/Documents/assessments/online retail"
setwd("C:/Users/pares/OneDrive/Documents/assessments/online retail")
Importing data
data <- read.csv("OnlineRetail.csv",stringsAsFactors = FALSE)
str(data)
'data.frame': 541909 obs. of 8 variables:
$ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
$ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
$ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
$ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
$ InvoiceDate: chr "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
$ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
$ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
$ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
We have 541,909 rows. Here, every row belongs to a separate product bought per invoice. This means, if customer bought 3 items in a single invoice, the data will have 3 separate rows for the particular invoice.
We need to correct data types for invoice-date. InvoiceNo is a string (It includes cancelled invoices starting with “C”)
#Converting invoice date to date type
data$InvoiceDate <- strptime(data$InvoiceDate, format = "%m/%d/%Y %H:%M")
summary(data)
InvoiceNo StockCode Description Quantity InvoiceDate
Length:541909 Length:541909 Length:541909 Min. :-80995.00 Min. :2010-12-01 08:26:00
Class :character Class :character Class :character 1st Qu.: 1.00 1st Qu.:2011-03-28 11:34:00
Mode :character Mode :character Mode :character Median : 3.00 Median :2011-07-19 17:17:00
Mean : 9.55 Mean :2011-07-04 13:59:07
3rd Qu.: 10.00 3rd Qu.:2011-10-19 11:27:00
Max. : 80995.00 Max. :2011-12-09 12:50:00
UnitPrice CustomerID Country
Min. :-11062.06 Min. :12346 Length:541909
1st Qu.: 1.25 1st Qu.:13953 Class :character
Median : 2.08 Median :15152 Mode :character
Mean : 4.61 Mean :15288
3rd Qu.: 4.13 3rd Qu.:16791
Max. : 38970.00 Max. :18287
NA's :135080
Before removing non-existent customers, lets look at negative values in quantity and unit-price.
sum(data$Quantity < 0)
[1] 10624
8,905 entries have negative values of quantity
head(data[data$Quantity < 0,])
Oh. This could be the cancelled items (InvoiceNo starting with “C”)! — When the customer is returning the item, quantity is subtracted.
We need a column to separate cancelled invoices from purchases.
#Creating a column that separates the purchased items vs the returned items!
data$purchased_item <- ifelse(grepl("C",data$InvoiceNo, fixed = TRUE) == TRUE, 0,1)
Is this the case with unit price too?
subset(data,data$UnitPrice < 0)
Bad debt is adjusted. These transactions were made by the operator to adjust money issues. Since, there is no customer involved here, eliminating this would not do any harm to us.
#Removing non-existent customer IDs
data <- na.omit(data)
How many customers are there in total?
length(unique(data$CustomerID))
[1] 4372
And, how many cancelled invoices do we have?
length(unique(data$InvoiceNo))
[1] 22190
Lets, look at our only categorical variable - country.
sort(table(data$Country))
Saudi Arabia Bahrain Czech Republic Brazil Lithuania
10 17 30 32 35
Lebanon RSA European Community United Arab Emirates Malta
45 58 61 68 127
Greece Canada Iceland Singapore Unspecified
146 151 182 229 244
Israel USA Poland Japan Denmark
250 291 341 358 389
Austria Sweden Cyprus Finland Channel Islands
401 462 622 695 758
Italy Norway Australia Portugal Switzerland
803 1086 1259 1480 1877
Belgium Netherlands Spain EIRE France
2069 2371 2533 7485 8491
Germany United Kingdom
9495 361878
Our customers come from a variety of countries. UK having the most transactions. There are researches done that show, purchasing behaviour and customers vary from different geographical locations. It would be wise, to consider similar customers and group them accordingly.
data <- subset(data,data$Country == "United Kingdom")
How many customers are we dealing with now?
cat("We have",length(unique(data$CustomerID)),"customers and a total of",length(unique(data$InvoiceNo)),"different invoices")
We have 3950 customers and a total of 19857 different invoices
We can now create our RFM metrics, but we need to consider cancelled orders to calculate recency and frequency variables.
#———————–LETS RFM————————-
#Calculate 3 different scores for each customer #1. RECENCY – How recently has the customer made his/her purchase? #2. FREQUENCY – How frequent is the customer? How many purchases over the given time frame? #3. MONETARY VALUE – How much amount does each customer bring in?
#Creating a new dataframe
customers <- as.data.frame(unique(data$CustomerID))
colnames(customers) <- "CustomerID"
#————RECENCY——————–
data$recency <- (max(data$InvoiceDate) + days(1)) - data$InvoiceDate
data$recency <- as.numeric(data$recency)
#Now this is in seconds, lets convert it in number of days...
#data$recency <- seconds_to_period(data$recency)
#data$recency <- data$recency@day + data$recency@hour/24 + data$recency@minute/(24*60) + data$recency@.Data/(24*60*60)
#Considering only the purchases made
temp <- subset(data, data$purchased_item == 1)
recency <- aggregate(recency ~ CustomerID, data = temp, FUN = min, na.rm = TRUE)
customers <- merge(customers, recency, all = TRUE)
remove(recency)
#————FREQUENCY——————
freq <- aggregate(InvoiceNo ~ CustomerID, data = temp, FUN = uniqueN)
customers <- merge(customers,freq, all = TRUE, by = "CustomerID")
colnames(customers)[3] <- "Frequency"
remove(freq)
#———-MONETARY VALUE—————
data$revenue <- data$Quantity * data$UnitPrice
revenue <- aggregate(revenue ~ CustomerID, data = data, FUN = sum)
customers <- merge(customers,revenue, all = TRUE)
remove(revenue)
summary(customers)
CustomerID recency Frequency revenue
Min. :12346 Min. : 1.00 Min. : 1.000 Min. : -4287.6
1st Qu.:14208 1st Qu.: 18.06 1st Qu.: 1.000 1st Qu.: 282.2
Median :15572 Median : 51.08 Median : 2.000 Median : 627.1
Mean :15562 Mean : 92.73 Mean : 4.246 Mean : 1713.4
3rd Qu.:16914 3rd Qu.:143.09 3rd Qu.: 5.000 3rd Qu.: 1521.8
Max. :18287 Max. :374.12 Max. :210.000 Max. :256438.5
NA's :29 NA's :29
We have 29 customers where we have NA values. These are the customers who have cancelled orders in our original data, but their purchased orders are not present. Therefore, we will not not consider these customers and remove them.
customers <- na.omit(customers)
summary(customers)
CustomerID recency Frequency revenue
Min. :12346 Min. : 1.00 Min. : 1.000 Min. : -1165.3
1st Qu.:14208 1st Qu.: 18.06 1st Qu.: 1.000 1st Qu.: 289.8
Median :15569 Median : 51.08 Median : 2.000 Median : 633.7
Mean :15561 Mean : 92.73 Mean : 4.246 Mean : 1728.4
3rd Qu.:16913 3rd Qu.:143.09 3rd Qu.: 5.000 3rd Qu.: 1530.8
Max. :18287 Max. :374.12 Max. :210.000 Max. :256438.5
#Observations…
There are negative values of revenue – probably customers who purchased something before the time frame given, but returned it in the given timeframe! Setting these values to zero!
customers$revenue[customers$revenue < 0] <- 0
This looks much better now.
sum(customers$recency == 0)
[1] 0
#—————-PRE-PROCESSING DATA ———————
Before, evaluating outliers, lets do a pareto analysis to understand how outliers can be a crucial factor here.
#———–The 80/20 rule / Pareto Analysis————
– Pareto Analysis – The rule says that more or less, 80% of the results come from the 20% of the causes! In this context, around 80% of sales should be caused by 20% of the customers. Meaning, top 20% customers contribute most to the sales – these would be are our high value customers! Lets see if this is true here.
pareto_cutoff <- 0.8*sum(customers$revenue)
customers <- customers[order(customers$revenue, decreasing = TRUE),]
customers$pareto <- ifelse(cumsum(customers$revenue) <= pareto_cutoff,"Top 20%","Bottom 80%")
customers$pareto <- as.factor(customers$pareto)
round(prop.table(table(customers$pareto)),2)
Bottom 80% Top 20%
0.71 0.29
This is close to 70/30 rule, but close enough! So, we have around 29% customers contributing to 71% of the sales (revenue)
customers <- customers[order(customers$revenue),]
# Lets look at it visually!
p1 <- ggplot(data = customers) +
geom_point(aes(x = Frequency,y = revenue, color = recency, shape = pareto)) + ggtitle("Original Values") + scale_shape_manual(name = "80/20 rule",values = c(3,8)) + theme(legend.position = "top")
ggplotly(p1)
NA
This map is very hard to read. Almost all the bottom 80% of customers are grouped together in lower corner. We need to evaluate RFM variables for skewness and deal accordingly.
#——How skewed is out data?———
ggplot(data = customers, aes(x = recency)) +
geom_histogram()
ggplot(data = customers, aes(x = Frequency)) +
geom_histogram()
ggplot(data = customers, aes(x = revenue)) +
geom_histogram()
NA
NA
NA
Lets, also look at mathematically
skewness(customers$recency)
[1] 1.244409
skewness(customers$Frequency)
[1] 10.80211
skewness(customers$revenue)
[1] 23.27612
All the variables are positively skewed! Lets take log tranform of each! And, We need to tranform the data (scale and normalize it!)
customers$log_recency <- log(customers$recency)
customers$log_frequency <- log(customers$Frequency)
customers$revenue <- customers$revenue + 0.01 #Since, there are 0 values, we dont wanna take log(0)
customers$log_revenue <- log(customers$revenue)
#Lets scale the data using Z- scores!
customers$z_recency <- scale(customers$log_recency, scale = TRUE, center = TRUE)
customers$z_frequency <- scale(customers$log_frequency, scale = TRUE, center = TRUE)
customers$z_revenue <- scale(customers$log_revenue, scale = TRUE, center = TRUE)
Lets look at the result now!
#Visualizing the log transformed and scaled variables!
p2 <- ggplot(data = customers) +
geom_point(aes(x = log_frequency,y = log_revenue, color = log_recency, shape = pareto)) + ggtitle("Log Tranformed values") + scale_shape_manual(name = "80/20 rule",values = c(1,2))
ggplotly(p2)
NA
This is better to read. We now can see how top 20% are higher in log_frequency, log_revenue and log_recency.
#Top Right Corner – has quite a few outliers – these are our high frequency; high revenue; and fairly recent customers – #Bottom Left Corner – these are outliers too – with low frequency; revenue
Lets see the results for scaled variables too.
p3 <- ggplot(data = customers) +
geom_point(aes(x = z_frequency,y = z_revenue, color = z_recency, shape = pareto)) + ggtitle("Scaled Values") + scale_shape_manual(name = "80/20 rule",values = c(1,2))
ggplotly(p3)
NA
NA
Both of these plots are almost similar- but now we can better analyze the data – Outliers are better visible!
The individual points includes multiple customers. Lets see, how many customers are there in the lowest point?
unique(customers$CustomerID[customers$log_revenue < 0] )
[1] 12346 13256 13364 13672 13762 14437 14557 14792 15802 15823 16454 16546 16742 16878 17548 17603 18072 18268
[19] 18274
Around 19 customers who bring very little value to the business!
#————————–MODELLING: K-MEANS ALGORITHM————————
###—- WHY?———-
K-MEANS gives disjoint sets - I wanted each customer to belong to one and only one segment! Also, has a linear time complexity O(n) as opposed to hierarchical which has a quadratic complexity - O(n^2)!
###— Dealing with outliers———-
Although k-means is sensitive to outliers, in our case, these outliers give useful information about our customers. Therefore, we will not remove them from consideration!
#—————-OPTIMAL NUMBER OF CLUSTERS————————–
#k-means requires us to give the number of clusters required! #To get the optimal number of clusters – we can do a number of things — #1. Elbow method #2. Silhouette method #3. Gap - Statistic method
#Goal is to group customers into high value and low value customers!
pre_processed <- customers[,9:11]
fviz_nbclust(pre_processed, kmeans,method = "wss")
#Therefore, the elbow method shows 4 as the optimal number of clusters
fviz_nbclust(pre_processed, kmeans,method = "silhouette")
#The Silhouette method gives 3 as the optimal number of clusters!
#fviz_nbclust(pre_processed, kmeans,method = "gap_stat")
#Gap statistic also gives 3 as the optimal number of clusters!
#Lets try to visualize results for clusters 2 to 10 and compare results!
models <- data.frame(k=integer(),
tot.withinss=numeric(),
betweenss=numeric(),
totss=numeric(),
rsquared=numeric())
for (k in 1:10){
output <- kmeans(pre_processed, centers = k,nstart = 20)
variable_name <- paste("Cluster",k,sep = "_")
customers[,(variable_name)] <- output$cluster
customers[,(variable_name)] <- factor(customers[,(variable_name)], levels = c(1:k))
#Graphing the clusters
colors <- c('red','orange','green3','deepskyblue','blue','darkorchid4','violet','pink1','tan3','black')
title <- paste("Cluster analysis with",k,"Clusters")
p4 <- ggplot(data = customers,aes(x = log_frequency, y = log_revenue)) +
geom_point(aes(color = customers[,(variable_name)])) + labs(color = "Cluster group") + ggtitle(title)
print(ggplotly(p4))
#Cluster centres in original metrics
cluster_centres <- ddply(customers,.(customers[,(variable_name)]), summarize,
revenue = round(median(revenue),2),
recency = round(median(recency),2),
frequency = round(median(Frequency),2)
)
names(cluster_centres)[names(cluster_centres) == "customers[, (variable_name)]"] <- "Clusters"
print(cluster_centres)
cat("\n")
cat("\n")
#Model Information
models[k,"k"] <- k
models[k,"tot.withinss"] <- output$tot.withinss
models[k,"betweenss"] <- output$betweenss
models[k,"totss"] <- output$totss
models[k,"rsquared"] <- output$betweenss/output$totss
}
NANA
NANA
NANA
NANA
NANA
NANA
NANA
did not converge in 10 iterations
NANA
NANA
Final Solution
scatter3d(x = customers$log_revenue, y = customers$log_frequency, z = customers$log_recency,
groups = customers$Cluster_4,
xlab = "Frequency (Log-transformed)",
ylab = "Monetary Value (log-transformed)",
zlab = "Recency (Log-transformed)",
surface.col = colors,
axis.scales = FALSE,
surface = FALSE, # produces the horizonal planes through the graph at each level of monetary value
fit = "smooth",
ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
grid = FALSE,
axis.col = c("black", "black", "black"))
#———–Final Analysis——————
#As the number of clusters increase, there interpretability is also affected.
#Cluster analysis with 2 clusters seems overly simplified!
#Cluster analysis with 3 clusters basically gives cluster 3 as high value customers, cluster 1 and 2 are similar (low value!)
#The decision should be based upon how the business plans to use the results, and the level of granularity they want to see in the clusters. I’d suggest 4 number of clusters should be good — Cluster 4 – high value customers; Cluster 1 and Cluster 2 – mid value customers and Cluster 3 are the zero value customers with low frequency and low revenue and are not very recent.